Computerized data mining system and program product

ABSTRACT

Under the present invention, a data exploration system, a customized model system and an existing model system are provided. The data exploration system analyzes user data to identify statistical information such as data distribution, data relationships, data outliners and invalid or missing data values. The customized model center iteratively generates customized data mining models in parallel based on permutations of the user data, user-provided business parameters and/or a set of model generation algorithms. The existing model system provides users with a library of existing data mining models, assembled based on the business parameters, from which they can choose one or more. In any event, any customized or existing data mining models selected can be run against the user data in parallel.

CROSS REFERENCE TO RELATED APPLICATIONS

The current application is a continuation application of co-pending U.S.patent application Ser. No. 10/720,792 (Attorney DocketRSW920030187US1), filed on Nov. 24, 2003, which is hereby incorporatedby reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

In general, the present invention provides a computerized data miningsystem, method and program product. Specifically, the present inventionprovides a network-based system for obtaining and executing a datamining model to provide business analytics.

2. Related Art

As businesses increasingly rely upon computer technology to performessential functions, data mining is rapidly becoming vital to businesssuccess. Specifically, many businesses gather various types of dataabout the business and/or its customers so that operations can be gaugedand optimized. Typically, a business will gather data into a database orthe like and then utilize a data mining model to analyze the data.

Unfortunately, many companies are unable to flexibly integrate dataanalytics into business processes because of the complexity, expense,and incomprehensibility often involved. For example, in terms ofinfrastructure, companies often must invest substantial resources tobuild data warehouses, implement servers, hire “mining experts” and ITstaff to use mining software, etc. In terms of processes, companies mustthen spend considerable time mapping and tuning between data and miningfunctions. To this extent, business analysts are typically required topossess the mining domain knowledge to choose the best mining algorithmand select appropriate data. In general, there can be more than twentytechnically oriented parameters to tune and map. However, in reality,business analysts might know their data and business objectives well,but might not have an in-depth knowledge of the mining algorithm and/orthe tuning parameters.

In fact, very few segments in industry have the resources (human andfinancial) to deploy sophisticated data analytics solutions such as datamining and scoring. Basically to deploy data mining techniques,companies have two choices: (1) acquire data mining tools and hire anindustry specialist to prepare the environment and set up the tool to beused; or (2) hire external consultants to avoid the lack of skills, andlarge investments in infrastructure companies. Both cases are anextremely expensive proposition for most companies due to the complexityof data integration and the tight binding of complex models to theanalytics process.

Heretofore, attempts have been made at automating the data miningprocess. No existing system, however, allows data mining models to beiteratively generated and/or executed in parallel. That is, any existingsystems that provide for the generation or execution of data miningmodels do so one data mining model at a time.

In view of the foregoing, there exists a need for a computerized datamining system, method and program product. Specifically, a need existsfor a system that can iteratively generate customized data mining modelsin parallel based on permutations of user data, user-provided businessparameters and/or a set of model generation algorithms. A further needexists for a system that allows a user to select existing data miningmodels from a library of existing data mining models that is assembledbased on the business parameters. Still yet, another need exists for asystem that can execute multiple customized or existing data miningmodels in parallel.

SUMMARY OF THE INVENTION

In general, the present invention provides a computerized data miningsystem, method and program product. Specifically, under the presentinvention, a data exploration system, a customized model system and anexisting model system are provided. The data exploration system analyzesuser data to identify statistical information such as data distribution,data relationships, data outliners and invalid or missing data values.The customized model center iteratively generates customized data miningmodels in parallel based on user data, user-provided business parametersand/or a set of model generation algorithms. The existing model systemprovides users with a library of existing data mining models, assembledbased on the business parameters, from which they can chose one or more.In any event, any customized or existing data mining models selected canbe run against the user data in parallel.

A first aspect of the present invention provides a computerized datamining system, comprising: a data exploration system for receiving andanalyzing user data to provide statistical information about the userdata; a customized model system for generating and ranking customizeddata mining models, and for executing a selected customized data miningmodel on the user data, wherein the customized data mining models areiteratively generated in parallel based on permutations of at least oneof the user data, business parameters and a set of model generationalgorithms; and an existing model system for selecting at least oneexisting data mining model from a library of existing data miningmodels, and for executing the selected at least one existing data miningmodel in parallel on the user data

A second aspect of the present invention provides a computerized systemfor generating and executing customized data mining models, comprising:a model generation system for iteratively generating the customized datamining models in parallel based on the permutations of at least one ofuser data, business parameters and a set of model generation algorithms;a model ranking system for ranking the customized data mining modelsbased on the business parameters, for identifying a predeterminedquantity of the ranked customized data mining models, and for providingcomparative data corresponding to the predetermined quantity of theranked customized data mining models; a customized model selectionsystem for selecting at least one customized mining model from thepredetermined quantity; and a customized model execution system forexecuting the selected at least one customized data mining model on theuser data.

A third aspect of the present invention provides a computerized systemfor selecting and executing existing data mining models: a model librarysystem for assembling a library of existing data mining models based ona business parameters, and for displaying the library of existing datamining models and comparative data corresponding to the library ofexisting data models; an existing model selection system for selectingat least one existing data mining model from the library of existingdata mining models; an existing model execution system for executing theat least one existing data mining model on the user data in parallel;and a model comparison system for comparing results of the execution ofthe at least one existing data mining model.

A fourth aspect of the present invention provides a computer-implementedmethod for generating customized data mining models, comprising:providing user data and business parameters; iteratively generating aplurality of customized data mining models in parallel based onpermutations of at least one of the user data, the business parametersand a set of model generation algorithms; ranking the plurality ofcustomized data mining models based on the business parameters;selecting at least one customized data mining model from the rankedplurality of customized data mining models; and executing the selectedat least one customized data mining model on the user data.

A fifth aspect of the present invention provides a computer-implementedmethod for selecting existing data mining models, comprising: providinguser data and business parameters; assembling a library of existing datamining models based on the business parameters; displaying the libraryof existing data mining models and comparative data corresponding to thelibrary of data mining models; selecting at least one existing datamining model from the library of existing data mining models; executingthe at least one existing data mining model on the user data inparallel; and comparing results of the execution of the at least oneexisting data mining model.

A sixth aspect of the present invention provides a data mining computerprogram product stored on a recordable medium, which when executed,comprises: program code for receiving and analyzing user data to providestatistical information about the user data; program code for generatingand ranking customized data mining models, and for executing a selectedcustomized data mining model on the user data, wherein the customizeddata mining models are iteratively generated in parallel based onpermutations of at least one of the user data, business parameters and aset of model generation algorithms; and program code for selecting atleast one existing data mining model from a library of existing datamining models, and for executing the selected at least one existing datamining model in parallel on the user data

A seventh aspect of the present invention provides a program productstored on a recordable medium for generating and executing customizeddata mining models, which when executed comprises: program code foriteratively generating the customized data mining models in parallelbased on the permutations of at least one of user data, businessparameters and a set of model generation algorithms; program code forranking the customized data mining models based on the businessparameters, for identifying a predetermined quantity of the rankedcustomized data mining models, and for providing comparative datacorresponding to the predetermined quantity of the ranked customizeddata mining models; program code for selecting at least one customizedmining model the predetermined quantity; and program code for executingthe selected at least one customized data mining model on the user data.

An eighth aspect of the present invention provides a program productstored on a recordable medium for selecting and executing existing datamining models, which when executed comprises: program code forassembling a library of existing data mining models based on a businessparameters, and for displaying the library of existing data miningmodels and comparative data corresponding to the library of existingdata models; program code for selecting at least one existing datamining model from the library of existing data mining models; programcode for executing the at least one existing data mining model on theuser data in parallel; and program code for comparing results of theexecution of the at least one existing data mining model.

Therefore, the present invention provides a computerized data miningsystem, method and program product.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features of this invention will be more readilyunderstood from the following detailed description of the variousaspects of the invention taken in conjunction with the accompanyingdrawings in which:

FIG. 1 depicts a computerized data mining system according to thepresent invention.

FIG. 2 depicts an illustrative business problem interface.

FIG. 3 depicts an illustrative model goal interface.

FIG. 4 depicts the customized model system of FIG. 1.

FIG. 5 depicts a method flow diagram for generating and executingcustomized data mining models according to the present invention.

FIG. 6 depicts the existing model system of FIG. 1.

FIG. 7 depicts a method flow diagram for selecting and executingexisting data mining models according to the present invention.

The drawings are merely schematic representations, not intended toportray specific parameters of the invention. The drawings are intendedto depict only typical embodiments of the invention, and thereforeshould not be considered as limiting the scope of the invention. In thedrawings, like numbering represents like elements.

DETAILED DESCRIPTION OF THE INVENTION

As indicated above, the present invention provides a computerized datamining system, method and program product. Specifically, under thepresent invention, a data exploration system, a customized model systemand an existing model system are provided. The data exploration systemanalyzes user data to identify statistical information such as datadistribution, data relationships, data outliners and invalid or missingdata values. The customized model center iteratively generatescustomized data mining models in parallel based on user data,user-provided business parameters and/or a set of model generationalgorithms. The existing model system provides users with a library ofexisting data mining models, assembled based on the business parameters,from which they can chose one or more. In any event, any customized orexisting data mining models selected can be run against the user data inparallel.

It should be understood in advance that the present invention could beimplemented as a “business method” such that it is provided as asubscription or profit-based system. For example, businesses wishing togenerate their own customized data mining models, or to select fromexisting data mining models, could be charged a one time fee or aperiodic subscription fee.

Referring now to FIG. 1, a computerized data mining system according tothe present invention is shown. In general, the system of FIG. 1 allowsusers such as user 24 to obtain a data mining model (e.g., create acustomized data mining model or select an existing data mining model),and then execute that data mining model against its user data. Ingeneral, the present invention is implemented in a network environmentsuch as over the Internet, a local area network (LAN), a wide areanetwork (WAN), a virtual private network (VPN), etc. Accordingly, user24 operates a client 26 to interact with server 10. To this extent,client 26 is intended to represent any type of computerized device thatis capable of communicating with server 10. For example, client 26 couldbe a personal computer, workstation, a laptop, a hand-held device, etc.In addition, communication between server 10 and client 26 could occurvia a direct hardwired connection (e.g., serial port), or via anaddressable connection that may utilize any combination of wirelineand/or wireless transmission methods. Server 10 and client 26 mayutilize conventional network connectivity, such as Token Ring, Ethernet,WiFi or other conventional communications standards. Moreover,connectivity could be provided by conventional TCP/IP sockets-basedprotocol. In this instance, client 26 could use an Internet serviceprovider to establish connectivity to server 10.

As shown, server 10 comprises central processing unit (CPU) 12, memory14, bus 16, input/output (I/O) interfaces 18, external devices/resources20 and storage unit 22. CPU 12 may comprise a single processing unit, orbe distributed across one or more processing units in one or morelocations. Memory 14 may comprise any known type of data storage mediumincluding magnetic storage medium, optical storage medium, random accessmemory (RAM), read-only memory (ROM), and a data cache memory. Moreover,similar to CPU 12, memory 14 may reside at a single physical location,comprising one or more types of data storage, or be distributed across aplurality of physical systems in various forms.

I/O interfaces 18 may comprise any system for exchanging informationto/from an external source. External devices/resources 20 may compriseany known type of external device, including speakers, a CRT, LCDscreen, handheld device, keyboard, mouse, voice recognition system,speech output system, printer, monitor/display, facsimile, pager, etc.Bus 16 provides a communication link between each of the components inserver 10 and likewise may comprise any known type of transmission link,including electrical, optical, wireless, etc.

Storage unit 22 can be any system (e.g., database) capable of providingstorage for data under the present invention such as user data,designated business parameters, model generation algorithms, etc. Assuch, storage unit 22 could include one or more storage devices, such asa magnetic disk drive or an optical disk drive. In another embodiment,storage unit 22 includes data distributed across, for example, a localarea network (LAN), wide area network (WAN) or a storage area network(SAN) (not shown). Further, although not shown, additional components,such as cache memory, communication systems, system software, etc., maybe incorporated into server 10. Moreover, it should be understood thatclient 26 will include computer components similar to server 10, suchcomponents have not been shown for brevity purposes.

Shown in memory of server 10 is analysis system 30, which includes datasubmission system 32, parameter designation system 34, data explorationsystem 36, customized model system 38 and existing model system 40. Itshould be understood that the depiction of analysis system 30 shown inFIG. 1 is intended to be illustrative only and that other variationscould be implemented. For example, data submission system 32 andparameter designation system could be implemented as a single “input”system. In any event, user 24 can gain access to analytics system 30 viaweb browser 28 on client 26. Once access is gained, user 24 will submituser data that is desired to be analyzed/mined via data submissionsystem 32. The user data can be submitted in a flat file or through anyother known means. Once the user data has been provided, user 24 canutilize data exploration system 36 to learn more “statistical”information about the user data. For example, once the user data issubmitted, data exploration system 36 will analyze the data to informuser 24 of information such as data distributions, data relationships,data outliners, missing and/or invalid data values, etc. Thisinformation is extremely helpful in aiding user 24's understanding ofhis/her user data.

Under the present invention, user 24 will then designate businessparameters via parameter designation system 34. In designating businessparameters, user 24 will use a series of interfaces provided byparameter designation system 34 to designate a business field/taxonomy(e.g., financial, banking, etc.), business problems (e.g., determiningwhether transactions are fraudulent, etc.) and goals for the eventualmining model (e.g., a misclassification cost). Referring to FIG. 2, anillustrative business problem interface 42 is shown. As depicted,interface 42 lists several problem “choices” 44 from which user 24 canselect. Referring to FIG. 3, an illustrative model goal interface 46 isshown for user 24 to designate the mining model goal(s). In a typicalembodiment, the mining model goal is quantified into numeric values 48for ranking the models (which will be further described below). Itshould be understood that interfaces 42 and 46 are illustrative only andthat other variations could be implemented.

Once user data and the business parameters have been provided, user 24can create a customized and/or select an existing data mining model. Ifuser 24 wishes to generate a customized data mining model, user 24 willimplement customized model system 38 (FIG. 1). Referring to FIG. 4,customized model system 38 is shown and described in greater detail. Asdepicted, customized model system 38 includes model generation system50, model ranking system 52, customized model selection system 54 andcustomized model execution system 56. Using the user data 58, businessparameters 60 and predetermined model generation algorithms, modelgeneration system 50 will iteratively generate a plurality of datamining models in parallel. Specifically, an “iterator” within modelgeneration system 50 will develop various permutations of the user data58, business parameters 60 and/or the model generation algorithms(collectively referred to as model generation details), and iterativelygenerate multiple data mining models based thereupon. For example, modelgeneration system 50 can perform permutations such as shuffling data,changing model generation algorithms, etc. Regardless, model generationsystem 50 will generate the data mining models in parallel (e.g., in agrid-like fashion) such that all data mining models are generated at thesame time/simultaneously. This avoids the inefficiencies with having togenerate each data mining model one at a time. To this extent, althoughnot shown in FIG. 1, multiple computerized “machines” could be providedin communication with server 10 that each generates one or more datamining models.

Once the customized data mining models have been generated as described,model ranking system 52 will rank the generated models based on how theywould address the business parameters 60 (e.g., mining model goals).Once ranked, a predetermined quantity of the ranked data mining modelswill be provided to user 24. For example, if model generation system 50generated ten data mining models, model ranking system 52 might rank allten, but only display the top five to user 24 (although all ten could bepresented if so desired). In viewing the presented rankings, user 24 canthen use customized model selection system 54 to select one or more ofthe generated data mining models. After selecting the desired datamining model(s), user 24 will use customized model execution system 56to execute each selected data mining models against the user data 58.Similar to the generation of the data mining models, the execution ofmultiple data mining models can be performed in parallel (e.g., in agrid fashion) by multiple machines (not shown) in communication withserver 10. In any event, the results of each such execution could becollated and be provided to user as output 64 for comparison.

Referring to FIG. 5, a method flow diagram 68 depicting the modelgeneration process provided by customized model system 38 is shown.First step C1 of method 68 is to submit user data and designate businessparameters. Second step C2 is to iteratively generate customized datamining models in parallel by making permutations of model generationdetails (e.g., the user data, business parameters and/or modelgeneration algorithms). Third step C3 is to rank the generatedcustomized data mining models, a predetermined quantity of which areidentified to the user in step C4. In step C5, the user will select atleast one customized data model from the ranking, and then execute theselected data mining model(s) in parallel against the data in step C6.

As indicated above, the present invention provides a user with thecapability to generate customized data mining models, and/or selectexisting data mining models. Referring now to FIG. 6, the existing modelsystem 40 by which existing data mining models are selected is shown. Asdepicted, existing model system 40 includes model library system 70,existing model selection system 72, existing model execution system 74and existing model comparison system 76. Similar to generating acustomized data mining model, user 24 (FIG. 1) will submit user data 58and business parameters 60. Once submitted, model library system 70 willaccess all existing data mining models. Based on the business parameters60, model library system 70 will then provide user 24 with a library ofapplicable data mining models 78 as well as comparative datacorresponding thereto. For example, if based on user 24's identifiedmining model goals, model library system 70 identified ten possibleexisting mining models that would be most applicable to user 24, thoseten data mining models would be identified to user 24 in an interface orthe like. Along with identifying the data mining models, model librarysystem 70 could display comparative data/statistics about how the miningmodels performed historically (e.g., number of observations, correcthits, false alarms, false dismissals, correct dismissals, etc.). Thistype of information could help user 24 in selecting the one or moreexisting data mining models that are the most appropriate and accurate.

Using existing model selection system 72, user 24 can then select one ormore data mining models from the library. Upon such a selection,existing model execution system 74 will execute each selected datamining model on the user data 58 in parallel (e.g. in a grid fashion).After being executed the results from each data mining model can beoutput to user 24 as output/results 64. At this point, user 24 cancompare the analytics across all selected data mining models and make along term decision to implement one or more of such models.

Referring now to FIG. 7, a method flow diagram 80 depicting the existingmodel selection process provided by existing model system 40 is shown.First step E1 is to submit user data and designate business parameters.Second step E2 is to assemble the library of existing data miningmodels, which is displayed along with comparative data in step E3. Next,the user can select at least one of the existing data mining models instep E4. Each selected data mining model will be executed against theuser data in parallel in step E5, and the result will be outputted forcomparison in step E6.

It should be understood that the present invention can be realized inhardware, software, or a combination of hardware and software. Any kindof computer system(s)—or other apparatus adapted for carrying out themethods described herein—is suited. A typical combination of hardwareand software could be a general purpose computer system with a computerprogram that, when loaded and executed, carries out the respectivemethods described herein. Alternatively, a specific use computer,containing specialized hardware for carrying out one or more of thefunctional tasks of the invention, could be utilized. The presentinvention can also be embedded in a computer program product, whichcomprises all the respective features enabling the implementation of themethods described herein, and which—when loaded in a computer system—isable to carry out these methods. Computer program, software program,program, or software, in the present context mean any expression, in anylanguage, code or notation, of a set of instructions intended to cause asystem having an information processing capability to perform aparticular function either directly or after either or both of thefollowing: (a) conversion to another language, code or notation; and/or(b) reproduction in a different material form.

The foregoing description of the preferred embodiments of this inventionhas been presented for purposes of illustration and description. It isnot intended to be exhaustive or to limit the invention to the preciseform disclosed, and obviously, many modifications and variations arepossible. Such modifications and variations that may be apparent to aperson skilled in the art are intended to be included within the scopeof this invention as defined by the accompanying claims. For example,although not shown, an authentication/verification system could beprovided for user 24 to log and be authenticated/verified before usinganalysis system 30. This could especially be the case if the presentinvention is implemented under a fee-based structure.

1. A computerized data mining system, comprising: a central processingunit; a memory operably associated with the central processing unit; anda data mining system storable in the memory and executable by thecentral processing unit, the data mining system comprising: a dataexploration system for receiving and analyzing user data to providestatistical information about the user data, wherein the statisticalinformation comprises data relationships, data outliners, invalid datavalues and standard deviations; a customized model system for generatingand ranking customized data mining models, and for executing a selectedcustomized data mining model on the user data, wherein the customizeddata mining models are generated using multiple iterations based onpermutations of at least one of the user data, business parameters and aset of model generation algorithms, wherein the business parameterscomprise a business taxonomy and a set of model goals, wherein thecustomized model system comprises: a model generation system forgenerating the customized data mining models in parallel using multipleiterations based on the permutations of at least one of the user data,the business parameters and the set of model generation algorithms; amodel ranking system for ranking the customized data mining models basedon the business parameters, for identifying a predetermined quantity ofthe ranked customized data mining models, and for providing comparativedata corresponding to the predetermined quantity of the rankedcustomized data mining models; a customized model selection system forselecting at least one customized mining model from the predeterminedquantity; and a customized model execution system for executing theselected at least one customized data mining model on the user data; andan existing model system for selecting at least one existing data miningmodel from a library of existing data mining models, and for executingthe selected at least one existing data mining model in parallel on theuser data, and outputting a result of the executing of the selected atleast one customized data mining model to a user.
 2. The computerizeddata mining system of claim 1, further comprising: a data submissionsystem for submitting the user data; and a parameter designation systemfor designating the business parameters.
 3. The computerized data miningsystem of claim 1, wherein the existing model system comprises: a modellibrary system for assembling the library of existing data mining modelsbased on the business parameters, and for displaying the library ofexisting data mining models and comparative data corresponding to thelibrary of existing data models; an existing model selection system forselecting the at least one existing data mining model from the libraryof existing data mining models; an existing model execution system forexecuting the at least one existing data mining model on the user datain parallel; and a existing model comparison system for comparingresults of the execution of the at least one existing data mining model.4. The computerized data mining system of claim 1, wherein the businessparameters comprise a business taxonomy, a set of business problems anda set of model goals.
 5. The computerized data mining system of claim 1,wherein the statistical information comprises data relationships, dataoutliners, invalid data values and standard deviations.
 6. Thecomputerized data mining system of claim 1, wherein the computerizeddata mining system is implemented in a network environment.
 7. Acomputer-readable storage medium storing computer instructions, whichwhen executed, enables a computer system to mine data, the computerinstructions comprising: receiving and analyzing user data to providestatistical information about the user data, wherein the statisticalinformation comprises data relationships, data outliners, invalid datavalues and standard deviations; generating and ranking customized datamining models, and executing a selected customized data mining model onthe user data, wherein the customized data mining models are generatedusing multiple iterations based on permutations of at least one of theuser data, business parameters and a set of model generation algorithms,wherein the generating and ranking comprises: generating the customizeddata mining models in parallel using multiple iterations based on thepermutations of at least one of the user data, the business parametersand the set of model generation algorithms, wherein the businessparameters comprise a business taxonomy and a set of model goals;ranking the customized data mining models based on the businessparameters, identifying a predetermined quantity of the rankedcustomized data mining models, and providing comparative datacorresponding to the predetermined quantity of the ranked customizeddata mining models; selecting at least one customized mining model fromthe predetermined_quantity; executing the selected at least onecustomized data mining model on the user data; and selecting at leastone existing data mining model from a library of existing data miningmodels, and executing the selected at least one existing data miningmodel in parallel on the user data, and outputting a result of theexecuting of the selected at least one customized data mining model to auser.
 8. The computer-readable storage medium storing computerinstructions of claim 7, wherein the instructions further comprise:submitting the user data; and designating the business parameters. 9.The computer-readable storage medium storing computer instructions ofclaim 7, wherein the selecting comprises: assembling the library ofexisting data mining models based on the business parameters, and fordisplaying the library of existing data mining models and comparativedata corresponding to the library of existing data mining models;selecting the at least one existing data mining model from the libraryof existing data mining models; executing the at least one existing datamining model on the user data in parallel; and comparing results of theexecution of the at least one existing data mining model.
 10. Thecomputer-readable storage medium storing computer instructions of claim7, wherein the business parameters comprise a business taxonomy, a setof business problems and a set of model goals.
 11. The computer-readablestorage medium storing computer instructions of claim 7, wherein thecomputer instructions are implemented in a network environment.